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Abstract 


We introduce the concept of coverage risk as an error measure for density ridge 
estimation. The coverage risk generalizes the mean integrated square error to set 
estimation. We propose two risk estimators for the coverage risk and we show that 
we can select tuning parameters by minimizing the estimated risk. We study the 
rate of convergence for coverage risk and prove consistency of the risk estimators. 
We apply our method to three simulated datasets and to cosmology data. In all 
the examples, the proposed method successfully recover the underlying density 
structure. 


1 Introduction 

Density ridges [10, 21, 16, 6] are one-dimensional curve like structures that characterize high density 
regions. Density ridges have been applied to computer vision [2], remote sensing [20], biomedical 
imaging [1], and cosmology [5, 7]. Figure 1 provides an example for applying density ridges to 
learn the structure of our Universe. 

To detect the density ridges from data, [21] proposed the ‘Subspace Constrained Mean Shift 
(SCMS)’ algorithm. SCMS is a modification of usual mean shift algorithm [15, 8] to adapt to the 
local geometry. Unlike mean shift that pushes every mesh point to a local mode, SCMS moves the 
meshes along a projected gradient until arriving at nearby ridges. Essentially, the SCMS algorithm 
detects the ridges of the kernel density estimator (KDE). Therefore, the SCMS algorithm requires 
a pre-selected parameter h, which acts as the role of smoothing bandwidth in the kernel density 


estimator. 


Despite the wide application of the SCMS algorithm, the choice of h remains an unsolved problem. 
Similar to the density estimation problem, a poor choice of h results in over-smoothing or under¬ 
smoothing for the density ridges. See the second row of Figure 1. 

In this paper, we introduce the concept of coverage risk which is a generalization of the mean 
integrated expected error from function estimation. We then show that one can consistently estimate 
the coverage risk by using data splitting or the smoothed bootstrap. This leads us to a data-driven 
selection rule for choosing the parameter h for the SCMS algorithm. We apply the proposed method 
to several famous datasets including the spiral dataset, the three spirals dataset, and the NIPS dataset. 
In all simulations, our selection rule allows the SCMS algorithm to detect the underlying structure 
of the data. 
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Figure 1: The cosmic web. This is a slice of the observed Universe from the Sloan Digital Sky 
Survey. We apply the density ridge method to detect filaments [7]. The top row is one example 
for the detected filaments. The bottom row shows the effect of smoothing. Bottom-Left: optimal 
smoothing. Bottom-Middle: under-smoothing. Bottom-Right: over-smoothing. Under optimal 
smoothing, we detect an intricate filament network. If we under-smooth or over-smooth the dataset, 
we cannot find the structure. 


1.1 Density Ridges 

Density ridges are defined as follows. Assume Xi, • • • , Xn are independently and identically dis¬ 
tributed from a smooth probability density function p with compact support K. The density ridges 
[10, 17, 6] are defined as 

R = {xeK: V{x)V{xfVp{x) = 0, X 2 {x) < 0}, 

where V{x) = [v 2 {x)^ • • • Vd{x)] with Vj{x) being the eigenvector associated with the ordered 
eigenvalue Xj{x) (Xi{x) > • • • > Xd{x)) for Hessian matrix H{x) = VVp(x). That is, R is 
the collection of points whose projected gradient V{x)V{x)'^Vp{x) = 0. It can be shown that 
(under appropriate conditions), R is a. collection of 1-dimensional smooth curves (1-dimensional 
manifolds) in 

The SCMS algorithm is a plug-in estimate for R by using 

= |a; e K ; Vn{x)Vn{x)'^Vpn{x) = 0,X2{x) < o|, 

wherepn{x) = ^ KDE and Vn and A 2 are the associated quantities defined 

by Pn. Hence, one can clearly see that the parameter h in the SCMS algorithm plays the same role 
of smoothing bandwidth for the KDE. 
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2 Coverage Risk 


Before we introduce the coverage risk, we first define some geometric concepts. Let be the i- 
dimensional Hausdorff measure [13]. Namely, is the length of set A and jJ^ 2 {A) is the area 

of A. Let ^4) be the projection distance from point x to a set A. We define Ur and as 

random variables uniformly distributed over the true density ridges R and the ridge estimator 
respectively. Assuming R and Rn are given, we define the following two random variables 

Wn = d{UR, Rn), Wn = d{U ^^, R). ( 1 ) 

Note that Ur,U^ are random variables while i?, Rn are sets. Wn is the distance from a randomly 
selected point on R to the estimator Rn and Wn is the distance from a random point on Rn to R. 

Let Haus(A, B) = inf{r : A c 5 0 r, 5 c A 0 r} be the Hausdorff distance between A and B 
where A 0 r = {x : d{x, A) < r}. The following lemma gives some useful properties about Wn 

and Wn- 


Lemma 1 Both random variables Wn and Wn cire bounded by Haus(M^, M). Namely, 
0 < Wn < Haus(i?n,^), 0 < Wn < Haus{Rn, R)- 
The cumulative distribution function (CDF) for Wn and Wn are 


P(^n < r\Rn) 


fii ft n {Rn © rf 

III {R) 


nWn < r\Rn) 


Ml 


(^Rn n (i? © r)) 



( 2 ) 

(3) 


Thus, 'P{Wn < T\Rn) is the ratio of R being covered by padding the regions around Rn at distance 
r. 


This lemma follows trivially by definition so that we omit its proof. Lemma 1 links the random 
variables Wn and Wn to the Hausdorff distance and the coverage for R and Rn- Thus, we call them 
coverage random variables. Now we define the Ci and C 2 coverage risk for estimating R by Rn as 


Riski n = 


0 ILn) 


Risk 2 ,n = 


E{W^ + w^) 


(4) 


That is, Riski^^ (and Risk 2 ,n) is the expected (square) projected distance between R and Rn- Note 
that the expectation in (4) applies to both Rn and Ur. One can view Risk 2 ,n as a generalized mean 
integrated square errors (MISE) for sets. 


A nice property of Riski^^ and Risk 2 ,n is that they are not sensitive to outliers of R in the sense that 
a small perturbation of R will not change the risk too much. On the contrary, the Hausdorff distance 
is very sensitive to outliers. 


2.1 Selection for Tuning Parameters Based on Risk Minimization 


In this section, we will show how to choose h by minimizing an estimate of the risk. 


We propose two risk estimators. The first estimator is based on the smoothed bootstrap [24]. We 
sample XI ^ • • • A* from the KDE Pn and recompute the estimator i?*. The we estimate the risk by 




where W* = d{Us , i?*) and W* = d(?7s., Rn)- 
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The second approach is to use data splitting. We randomly split the data into • • • , and 
X 21 , • • • , X 2 ^, assuming n is even and 2m = n. We compute the estimated manifolds by using 
half of the data, which we denote as R\ ^ and Then we compute 


Riskj „ =- - -, Risk 2 _„ =- - - 

where wl^ = J i?t J. 


( 6 ) 


Having estimated the risk, we select h by 


/i* = argmin 


Risk^, 


(7) 


h<hn 

where hn is an upper bound by the normal reference rule [25] (which is known to oversmooth, so 
that we only select h below this rule). Moreover, one can choose h by minimizing £2 risk as well. 


In [11], they consider selecting the smoothing bandwidth for local principal curves by self-coverage. 
This criterion is a different from ours. The self-coverage counts data points. The self-coverage is 
a monotonic increasing function and they propose to select the bandwidth such that the derivative 
is highest. Our coverage risk yields a simple trade-off curve and one can easily pick the optimal 
bandwidth by minimizing the estimated risk. 


3 Manifold Comparison by Coverage 


The concepts of coverage in previous section can be generalized to comparing two manifolds. Let 
Ml and M 2 be an -dimensional and an ^ 2 -dimensional manifolds (^1 and ^2 are not necessarily 
the same). We define the coverage random variables 


W^2=d{UM,.M2), 




( 8 ) 


Then by Lemma 1, the CDF for W 12 and W 21 contains information about how Mi and M 2 are 
different from each other: 


nWi 2 <r) = 


(Min(M2 0r)) 


nW 2 i <r) = 


(M 2 n(Mi 0 r)) 


(-^l) /^r2 

¥{Wi 2 < r) is the coverage on Mi by padding regions with distance r around M 2 . 


(9) 


We call the plots of the CDF of W 12 and W 21 coverage diagrams since they are linked to the 
coverage over Mi and M 2 . The coverage diagram allows us to study how two manifolds are different 
from each other. When ^1 = ^ 2 , the coverage diagram can be used as a similarity measure for two 
manifolds. When ^1 7 ^ -^ 2 , the coverage diagram serves as a measure for quality of representing high 
dimensional objects by low dimensional ones. A nice property for coverage diagram is that we can 
approximate the CDF for W 12 and W 21 by a mesh of points (or points uniformly distributed) over 
Ml and M 2 . In Figure 2 we consider a Helix dataset whose support has dimension d = 3 and we 
compare two curves, a spiral curve (green) and a straight line (orange), to represent the Helix dataset. 
As can be seen from the coverage diagram (right panel), the green curve has better coverage at each 
distance (compared to the orange curve) so that the spiral curve provides a better representation for 
the Helix dataset. 


In addition to the coverage diagram, we can also use the following £1 and £2 losses as summary for 
the difference: 


L 0 SSi(Mi,M 2 ) 


E(ILi2 0VF2i) 
2 


LoSS 2 (Mi,M 2 ) 


E(ILf 2 +^ 2 l) 

2 


( 10 ) 


The expectation is take over Umi and UM 2 and both Mi and M 2 here are fixed. The risks in (4) are 
the expected losses: 


Riski^n = E ^Lossi(Mn, M)^ , Risk2,n = E ^Loss2(Mn, M)^ 


( 11 ) 


4 









o 


o 

d 


0.0 


0.2 


0.4 


0.6 


0.8 


1.0 


r 


Figure 2: The Helix dataset. The original support for the Helix dataset (black dots) are a 3- 
dimensional regions. We can use green spiral curves (d = 1) to represent the regions. Note that 
we also provide a bad representation using a straight line (orange). The coverage plot reveals the 
quality for representation. Left: the original data. Right: the coverage plot for the spiral curve 
(green) versus a straight line (orange). 

4 Theoretical Analysis 

In this section, we analyze the asymptotic behavior for the coverage risk and prove the consistency 
for estimating the coverage risk by the proposed method. In particular, we derive the asymptotic 
properties for the density ridges. We only focus on C 2 risk since by Jensen’s inequality, the £2 risk 
can be bounded by the £1 risk. 

Before we state our assumption, we first define the orientation of density ridges. Recall that the 
density ridge i? is a collection of one dimensional curves. Thus, for each point x e R, we can 
associate a unit vector e{x) that represent the orientation of at x. The explicit formula for e{x) 
can be found in Lemma 1 of [6] . 

Assumptions. 

(R) There exist > 0 such that for all x G 5r, 


Xi{x) >/3o - I3i, ||Vp(a:)||||p(^)(x)||max </3o(/?i -/32), (12) 


max 


where (x) Umax is the element wise norm to the third derivative. And for each x E R, 



(Kl) The kernel function Ff G BC^ and is symmetric, non-negative and 



for all a = 0,1, 2,3. 

(K2) The kernel function K and its partial derivative satisfies condition iTi in [18]. Specifically, 
let 



(13) 


We require that K satisfies 



(14) 
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for some positive number A, v, where N(T, e) denotes the e-covering number of the 
metric space (T, d) and F is the envelope function of K and the supreme is taken over 
the whole The A and v are usually called the VC characteristics of JC. The norm 
\\F\\L2iP) = supp J \F{x)\^dP{x). 

Assumption (R) appears in [6] and is very mild. The first two inequality in (12) are just the bound 
on eigenvalues. The last inequality requires the density around ridges to be smooth. The latter part 
of (R) requires the direction of ridges to be similar to the gradient direction. Assumption (Kl) is 
the common condition for kernel density estimator see e.g. [26] and [23]. Assumption (K2) is to 
regularize the classes of kernel functions that is widely assumed [12, 17, 4]; any bounded kernel 
function with compact support satisfies this condition. Both (Kl) and (K2) hold for the Gaussian 
kernel. 

Under the above condition, we derive the rate of convergence for the £2 risk. 

Theorem 2 Let Risk 2 ,n be the C 2 coverage risk for estimating the density ridges and level sets. 
Assume (Kl-2) and (R) and p is at least four times bounded differentiable. Then as n ^ 00 , h ^ ^ 


Risk 


2,n 




1 


nh^+‘^ 


for some Br and (j\ that depends only on the density p and the kernel function K. 


The rate in Theorem 2 shows a bias-variance decomposition. The first term involving is the 
bias term while the latter term is the variance part. Thanks to the Jensen’s inequality, the rate of 
convergence for Ci risk is the square root of the rate Theorem 2. Note that we require the smoothing 
parameter h to decay slowly to 0 by 0- This constraint comes from the uniform bound 

for estimating third derivatives for p. We need this constraint since we need the smoothness for 
estimated ridges to converge to the smoothness for the true ridges. Similar result for density level 
set appears in [3, 19]. 

By Lemma 1 , we can upper bound the £2 risk by expected square of the Hausdorff distance which 
gives the rate 

Risk 2 ,„ < E (Haus^Rn, R)) = 0{h^) + O (15) 

The rate under Hausdorff distance for density ridges can be found in [6] and the rate for density 
ridges appears in [9]. The rate induced by Theorem 2 agrees with the bound from the Hausdorff 
distance and has a slightly better rate for variance (without a log-n factor). This phenomena is 
similar to the MISE and Coo error for nonparametric estimation for functions. The MISE converges 
slightly faster (by a log-n factor) than square to the Coo error. 

Now we prove the consistency of the risk estimators. In particular, we prove the consistency for the 
smoothed bootstrap. The case of data splitting can be proved in the similar way. 


Theorem 3 Let Risk 2 ,n be the £2 coverage risk for estimating the density ridges and level sets. Let 
Risk 2 ,n be the corresponding risk estimator by the smoothed bootstrap. Assume (Kl-2) and (R) and 
p is at least four times bounded differentiable. Then as n ^ 00 , h ^ and 0, 

Risk2,n - Risk2,n Q 
Risk2,n 


Theorem 3 proves the consistency for risk estimation using the smoothed bootstrap. This also leads 
to the consistency for data splitting. 
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Figure 3: Three different simulation datasets. Top row: the spiral dataset. Middle row: the three 
spirals dataset. Bottom row: NIPS character dataset. For each row, the leftmost panel shows the 
estimated Ci\ coverage risk using data splitting. Then the rest three panels, are the result using 
different smoothing parameters. From left to right, we show the result for under-smoothing, optimal 
smoothing (using the coverage risk), and over-smoothing. 


5 Applications 

5.1 Simulation Data 


We now apply the data splitting technique (7) to choose the smoothing bandwidth for density ridge 
estimation. The density ridge estimation can be done by the subspace constrain mean shift algorithm 
[21]. We consider three famous datasets: the spiral dataset, the three spirals dataset and a ‘NIPS’ 
dataset. 

Figure 3 shows the result for the three simulation datasets. The top row is the spiral dataset; the 
middle row is the three spirals dataset; the bottom row is the NIPS character dataset. For each row, 
from left to right the first panel is the estimated £i risk by using data splitting. The second to fourth 
panels are under-smoothing, optimal smoothing, and over-smoothing. Note that we also remove the 
ridges whose density is below 0.05 x mdiXxPn{x) since they behave like random noise. As can be 
seen easily, the optimal bandwidth allows the density ridges to capture the underlying structures in 
every dataset. On the contrary, the under-smoothing and the over-smoothing does not capture the 
structure and have a higher risk. 


5.2 Cosmic Web 

Now we apply our technique to the Sloan Digital Sky Survey, a huge dataset that contains millions 
of galaxies. In our data, each point is an observed galaxy with three features: 
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Figure 4: Another slice for the cosmic web data from the Sloan Digital Sky Survey. The leftmost 
panel shows the (estimated) £i coverage risk (right panel) for estimating density ridges under dif¬ 
ferent smoothing parameters. We estimated the coverage risk by using data splitting. For the 
rest panels, from left to right, we display the case for under-smoothing, optimal smoothing, and 
over-smoothing. As can be seen easily, the optimal smoothing method allows the SCMS algorithm 
to detect the intricate cosmic network structure. 


• z: the redshift, which is the distance from the galaxy to Earth. 

• RA: the right ascension, which is the longitude of the Universe. 

• dec: the declination, which is the latitude of the Universe. 

These three features (z, RA, dec) uniquely determine the location of a given galaxy. 

To demonstrate the effectiveness of our method, we select a 2-D slice of our Universe at redshift 
2 ; = 0.050 — 0.055 with {RA, dec) G [200, 240] x [0,40]. Since the redshift difference is very tiny, 
we ignore the redshift value of the galaxies within this region and treat them as a 2-D data points. 
Thus, we only use RA and dec. Then we apply the SCMS algorithm (version of [7]) with data 
splitting method introduced in section 2.1 to select the smoothing parameter h. The result is given in 
Figure 4. The left panel provides the estimated coverage risk at different smoothing bandwidth. The 
rest panels give the result for under-smoothing (second panel), optimal smoothing (third panel) and 
over-smoothing (right most panel). In the third panel of Figure 4, we see that the SCMS algorithm 
detects the filament structure in the data. 


6 Discussion 


In this paper, we propose a method using coverage risk, a generalization of mean integrated square 
error, to select the smoothing parameter for the density ridge estimation problem. We show that 
the coverage risk can be estimated using data splitting or smoothed bootstrap and we derive the 
statistical consistency for risk estimators. Both simulation and real data analysis show that the 
proposed bandwidth selector works very well in practice. 

The concept of coverage risk is not limited to density ridges; instead, it can be easily generalized to 
other manifold learning technique. Thus, we can use data splitting to estimate the risk and use the 
risk estimator to select the tuning parameters. This is related to the so-called stability selection [22], 
which allows us to select tuning parameters even in an unsupervised learning settings. 


A Proofs 


Before we prove Theorem 2, we need the following lemma for comparing two curves. 


Lemma 4 Let Si, S 2 be two bounded smooth curves in Let 7ri2 : Si S 2 and 1121 S 2 ^ Si 
be the projections between them. For a ^ Si and b G S 2 , define gi{a) and g 2 {b) as the unit tangent 
vectors for Si and S 2 at a and b respectively. Assume Si and S 2 are similar in the following sense: 
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(SI) 7Ti 2 and 1121 are one-one and onto, 


(S2) the projections are similar: 


max<^ sup \\TTi2ix) - 7r2i^(a;)||, sup \\TT2iix) - T^ipx)\\ [ = 0(ei), 

IxeSi xeS 2 J 


(S3) the tangent vectors are similar: 


max<^ sup \gi{x)^ g2{'Ki2{x))\, sup |^2(^) ^i(7r2i(:^^))| \ =1 + 0(62), 
IxeSi xeS2 


(54) the length are similar: 

length(5'l) — length(5'2) = 0{es) 

with 61 , 62,63 being very small. LetXi = \\x — 7ri2{x)\\‘^dx and I 2 = \\y — T: 2 i{y)\\^dy. 

Then we have 

\Xi — X 2 I = V^O(ei) +^ 20(62 + 63 ). 

Moreover, if we further assume 

(55) the Hausdorff distance Haus(5'i, 5 ' 2 ) = 0 ( 64 ) is small, 

then for any function ^ 1 -^ M that has bounded continuous derivative, we have 

[ i{li{t))dt= [ ^(72(t))<it(l+ 0 ( 62 + 63 + 64 )). 

Jo Jo 


Proof. Since Si and S2 are two bounded, smooth curves. We may parametrized them by 71 : 
[0,1] Si and 72 : [0,1] S2 with 

= 51(71 W), 71 ( 0 ) =si, 
l2{t) = 52 (72(i)), 72(0) = S 2 = 7112(71(0)), 

where gi = iigi and g2 = £292 for ^1, ^2 being the length of Si and S2 and si one of the end point 
of Si . The constant £j works as a normalization constant since gj is an unit vector; it is easy to 
verify that 

\\gj{t)\\dt= [ lj\\gj{t)\\dt = 

Jo 

The starting point 52 G S2 must be the projection 7 ri 2 (si) otherwise the condition (SI) will not hold. 


Iength(5'j) = [ 
Jo 


Let 


^ 1 = [ Il7i(i) - 7112(71 (^))ll^c^6 ^ 2 =/ ||72(i) - 7i2i(72(i))||^c?^- (17) 

Jo Jo 


Then the goal is to prove Xi —I2 = 0 ( 6 i) + 0(62). 

Now we consider another parametrization for £'2• Let 72 • [0,1] 1-^ S2 such that 72(f) = 7ri2(71(f)). 

By (SI), 772 is a parametrization for S2. The parametrization 772(f) has the following useful proper¬ 
ties: 

772(0) = 7ri2(7l(0)) = 52, 

52(0 = 92{m{t))92{V2{t)f'l[{t) = 52(52(i))52(71l2(7l(i)))7l(7l(i))- 
By condition (S 3 ) and (S 4 ), we have 

52(7ii2(7i(^)))7i(7i(i)) = ^i52(7ii2(7i(i)))7i(7i(^)) 

= £ 1(1 + 0 ( 62 )) (19) 

= £ 2(1 + 0 ( 62 ) + 0 ( 63 )) 
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uniformly for all t G [0,1]. Now apply this result to r]2{t), we obtain that 

V 2 {t) = + 0 ( 62 ) + 0 ( 63 )). (20) 

Together with 772 ( 0 ) = 72 ( 0 ), we have 

sup = 0 ( 62 ) ^ 0 ( 63 ). ( 21 ) 

te[o,i] 


Now by definition of Xi and the fact that ( 72 (t)) = 71 (f), we have 

^ 1 = rii7iw-^ 12(71 
Jo 

= [ hi2im{t)) - V2{t)fdt 
Jo 

= [ \\T^ 2 iiri 2 it)) + 0 {ei)-ri 2 {t)\fdt by (S2) 
Jo 

= I'+ Vl| 0 (ei), 

where= 7 h 2 iim{t)) - ri 2 {t)fdt. 


( 22 ) 


Now we bound the difference between I 2 and I 2 . Let U be an uniform distribution over [0,1] and 
define h{x) : [0,1] 1-^ M as h{x) = ||7r2i(72(^)) — 72(^)||- Note that it is easy to see that h{x) has 
bounded derivative. Then, 

I 2 = E||7r2i(72((/)) - l2{U)f = Eh{U). (23) 

Since both 72 and ri 2 are parametrization for the curve 82 ,^ 2 ^ defined for all image of 772 . 

We define the random variable W = 7 ^^( 7 ? 2 (C^))- Then by definition of I 2 , 

X '2 = E||7r2i (%([/)) - n 2 {U)f = m{W). (24) 

Since sup^gjo,!] \\l 2{0 - '<l 2 {t)\\ = 0 {e 2 ) + ^(ea), we have = x + 0 (e 2 ) + 0(e3). 

Thus, the pw {t) —pu {t) = 0(e2) + 0 ( 63 ), where pw and pu are the probability density for random 
variable W and U. Since U is uniform distribution, p[/ = 1 so that 


This implies X 2 


Eh{W) = [ 
Jo 



h{t)pw{t)dt 

h{t){pu{t) + 0 ( 62 ) + 0 {e 3 ))dt 
h{t){l + 0 ( 62 ) + 0 {e 3 ))dt 


= E/i(f/)(l +0(62)+ 0 ( 63 )). 


^ 2(1 + 0 ( 62 ) + 0 ( 63 )). Therefore, by (22) we conclude 

Xi = X' + v^O(ei) 

= X 2 + \^0{ei) +X2(0(e2) + 0 ( 63 )), 


(25) 


(26) 


which completes the proof for the first assertion. 


Now we prove the second assertion, here we will assume (S5). Since ^ has bounded first derivative. 


[ i{'li{t))dt= ( 
Jo Jo 

7‘ 


^{'^i 2 {'yi{t)))dt{l + 0(Haus(S'i, S' 2 ))) 

^{ri 2 {t))dt{l + 0 {€ 4 ,)). 


(27) 
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Again, let U be the uniform distribution and W = 72 ^(^2 (^))- We now define the function 


h{t) = <^( 72 (t)) for t G [0,1]. Since both ^ and 72 are bounded differentiable, h is also bounded 
differentiable. Then it is easy to see that 



(28) 


^0 

Now by the same derivation of (25), we conclude 



(29) 


^0 

Thus, by (27) and (29), we conclude 



(30) 


which completes the proof. 

□ 

The following Lemma bounds the rate of convergence for the kernel density estimator and will be 
used frequently in the following derivation. 

Lemma 5 (Lemma 10 of [6]; see also [17]) Assume (K1-K2) and that log n/n <h^< hfor some 
0 < b < 1. Then we have 



(31) 


for /c = 0, • • • ,3. Moreover, 



(32) 


Proof for Theorem 2. Here we prove the case for density ridges. The case for density level 


set can be proved by the similar method. We will use Lemma 4 to obtain the rate. Our strategy is 
that first we derive Rn)‘^) and then show that the other part E(d(f/^ , i?)^) is similar to 


the first part. 

Part 1. We first introduce the concept of reach [14]. For a smooth set A, the reach is defined as 


reach (A) = inf{r : every point in A 0 r has an unique projection onto A.}. (33) 


The reach condition is essential to establish a one-one projection between two smooth sets. 
By Lemma 2, property 7 of [ 6 ], 



(34) 


for some constant A 2 . Note that Sr and /32 are the constants in condition (R). 

Thus, as long as Rn is close to R, every point on Rn has an unique projection onto R. Similarly, 
reach(Rn) will have a similar bound to reach(i?) whenever \\pn — p|| 2 max i^ small (reach only 
depends on fourth derivatives). Hence, every point on R will have an unique projection onto Rn. 
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The projections between R and will be one-one and onto except for points near the end points 
for R and Rn. That is, when ||pn — ^112 max sufficiently small, there exists R^ C R and Rl^ C Rn 
such that the projection between R^ and R}^ are one-one and onto. Moreover, the length difference 

length(i?) - length(i?’*') = 0{Haus{Rn, R)), 

length(i?^) — length(i?]^) = 0{Haus{Rn, R)). 

Note that by Theorem 6 in [17], 

Haus(^n, R) = 0{\\pn - p||2,max)- (36) 


(35) 


Let X e R^, and let x' = irp (x) G Rl^ be its projection onto Rn- Then by Theorem 3 in [6] (see 
their derivation in the proof, the empirical approximation, page 30-32 and equation (79)), we have 
x' -X = W 2 {x){gn{x) - g{x)){l + 0{\\pn - p||3,max))> (37) 

where 

W 2 {x) = N{x)Hpix)Nix) 

Hn{x) = N{xfH{x)N{x) 
and A^(x) is a d X (d—l) matrix called the normal matrix for Rdiix whose columns space spanned 
the normal space for R at x. The existence for N{x) is given in Section 3.2 and Lemma 2 in [6]. 
Thus, we have 

|2 


(38) 


E I 


{d{x, Rnf) = E (||x - a;'||2) = E \\W 2 {x){gn{x) - g{x))f + A„, 
where is the remaining term and by Cauchy-Schwartz inequality, 

A„ < E \\W 2 ix)ig^{x) - g{x))f 0(E||p„ - 


(39) 


Thus, 


E (d{x, Rn?) = E \\W 2 {x){gn{x) - g{x))f 


= E \\W 2 {x){gn{x) - E{gn{x)) + E{gn{x)) - g{x))f 4 

= Tr(Cov(W"2(^)5n(^))) + \\W 2 {x){Eignix)) - 9ix))? 


(40) 


1 


nh^+‘^ 


Tr(i;(x)) + h^b{x)'^ b{x) + o 


1 


nh^+‘^ 




where 


(41) 


I](a;) = W2{x)E{K)W2{x)pix), 
b{x) = c{K)W2{x)V{V^p{x)) 
are related to the variance and bias for nonparametric gradient estimation (T>{K)p{x) is the asymp¬ 
totic covariance matrix forp^ and c(7f) V(V^p(x)) is the asymptotic bias for p^). 'E{K) is a. matrix 

and c{K) is a scalar; they both depends only on the kernel function K. ^ + • • • + ^2 is 

the Laplacian operator. 

Now we compute E(d(f/R, 7?^)^)- Note that since the length difference between R and R^ is 
bounded by (35) and (36): 

¥{UReR? = l-0{E{\\pn-p\\ln..?) 

Note that we use Lemma 5 to convert the norm into probability bound. By tower property (law of 
total expectation), 

E{d{UR,Rn?) =EiE{d{UR,Rn?\UR)) 

= E{E{d{UR,Rn?\UR,UR e R?)¥{Ur e R? 

+ E{E{d{UR,Rn?\UR,UR i R?)¥{Ur i R? (43) 

1 


= E 


nh^+‘^ 


M^{UR))^h%{UR)H{UR) +0 


nh^+‘^ 
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Note that by (42), the contribution from ¥{Ur ^ R^) is smaller than the main effect in (40) so we 
absorb it into the small o terms. Defining = E{b{UR)^b{U r)) and (Ur))), we 

obtain 


E{d{UR,Rnf) 


Blh^ 


nh^+‘^ 


1 


nh^+‘^ 




(44) 


Part 2. We have proved the first part for the £2 coverage risk. Now we prove the result for 
E{d{U^ ,RY)\ this will apply Lemma 4. If we think of as Si and Rl^ as S 2 in Lemma 4, 
then 


E{d{UR,,Rif\X^,■■■ ,X„)= f hi{t)-T:i2(ni{t))fdt=Ii 

Jo 

E{d{Uf^uR^f\Xi,--- ,Xn) = h2{t)-Tr2l{72{t))fdt=l2. 


(45) 


Thus, E(d(f/^t, R^)‘^) is approximated by E((i(f/^t, if the ei, 62 , 63 in Lemma 4 is small. 

Here we bound ej . 


The bound for ei is simple. For all x G Si, let 0 be the angle between the two vectors vi = 
' 7 ri 2 (^) — X and V 2 = 7121 ~ property of projection, vi is normal to Rn at 7 ri 2 (x) 

and V 2 is normal to Rat x. Thus, by Lemma 2 properties 5 and 6 of [ 6 ], the angle 0 is bounded by 
0 {\\pn — pWs max)- ^ote that their Lemma proves the normal matrices N{x) and Nn{7Ti2{x)) are 
close which implies the canonical angle between two subspace are close so that 0 is bounded. Now 
by the fact that both ||7ri2(x) — x\\ and ||7r^i^(x) — x\\ are bounded by Haus(i?n, R), we conclude 
ei < Haus(^„,i?) x 6 > = 0{\\pn - 


For 62 , we will use the property of normal matrix N{x). Let Nn{x) be the normal matrix for R^ at 
X. By Lemma 2, properties 5 and 6 of [ 6 ], 

\\Nix)Nixf - = 0(Haus(E„,i?)) + 0(||p„ 

= 0(l|Pn-p|IW)- 

N{x)N{x)'^ is the projection matrix onto normal space; so the tangent vector is perpendicular to 
that projection. The bounds for the two projection matrix implies the bound to the two tangent 
vectors. Thus, 62 = 0(||Fn - p|| 3 ,max)- 


For 63 , since the smoothness for R^ is similar to R (the normal direction is similar by 62 ) and their 
Hausdorff distance is bounded by 0(||pn — p \\2 max)- The length difference is at the same rate of 
Hausdorff distance. Thus, we may pick 63 = 0 {\\pn — p\\2 max)- 

Let Xi = E{d{URi, RI^)‘^\Xi, • • • , Xn) andX 2 = E(d(f/^t, R^)‘^\Xi, - - - , Xn). By Lemma 4 and 
the above choice for ej , we conclude 

Jl = l2{l + 0{\\pn - P|l3,ma.)) + V^O(||p„ - p\\l%,^). (46) 

Thus, by tower property again (taking expectation over both side) and Lemma 5 E||p^ — pjjg = 

OOT + o(\/l&)=o(l). 

E{d{URuRi?)=mi)=ni2)+o{l)=E{d{U^uR^)^)+o{l). (47) 


Now since by (35) and the fact that EHaus(i?n, R) = o(l), we have 

E{d{UR,, Rif) = E(d(UR, i?„)2)(l + 0(1)) 
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Combining by (44), (47) and (48), we conclude 


Risk2,„ =- - -=- 

= E(d(l7jt,:Rnf)+o(l) (49) 

where = E{b{UR)'^b{UR)) and (7\ = E{Jr{E{UR))). Note that all the above derivation works 
only when 

E\\p„-p\\l^,^ = Oih^)+o(J^^'^ =0(1). (50) 

This requires h 0 and ^^^+6 0. which constitutes the conditions on h we need. 

□ 


Proof for Theorem 3 . Since we are proving the bootstrap consistency, we assume Xi, • • • , Xn 
are given. 

By Theorem 2, the estimated risk Riskn ,2 has the following asymptotic behavior 

Risk „,2 + +o()i^), (51) 

where 

Bl = E (bn{U^ )^n{U^ )|Xi, • • • ,X„) , 

) ^ " " X (52) 

?| = E(Tr(E„(f/^J)|Xi,-- - ,X„) 

with5„(x) = c{K)W 2 {x)V{W^Pn{x)) and£„(rc) = W 2 {x)E{K)W 2 {x)pn{x) from (41). To 
prove the bootstrap consistency, it is equivalent to prove that and converges to Br and cf\. 


Here we prove the consistency for Br. The consistency for aR can be proved in the similar way. 
We define the following two functions 

^.(x) = \\c{K)W2{x)X{X^Um^ ^ 33 ^ 

n{x) = \\c{K)W2{x)V{V^p{x))f. 

It is easy to see that Bj^ = E (^Qri{Up )\Xi, • • • , Xn^ and Bj^ = E {CI{Ur)). 

Similarly as in the proof for Theorem 2, we define Rl^ C Rn that has one-one and onto projection 
to R^ . By (42), we can replace Up by Upt and Ur by Ur^ at the cost of probability 0{h‘^) + 



Now we will apply Lemma 4 again to prove the result. Again, we think of R^ as Si and Rl^ as 
S 2 . Let U be an uniform distribution over [0,1]. Then the random variable Uri = ji{U) and 
Upt^ = 72 (^). Thus, 

E(n([/flt)) = ^ 11(71 E(7(C/flt)|^i,---,^n) iln(72(l))rfl- (54) 
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By the second assertion in Lemma 4, 


( 55 ) 


Jo 

— [ ^( 72(^))<^^(1 + 0(62) + 0(63) + 0(64)) 

Jo 

= [ l^(72(t))c^t(l + 0(||pn -p||3,max))- 

Jo 

Note that we use the fact that Haus{Rn, R) = 0{\\pn — p \\2 max)- Since Q only involves third 
derivative for the density p, we have ll^(^) ~ ^n(^)|| = 0{\\pn — p|| 3 ,max)- This implies 

f n{-f 2 {t))dt= f ^n(72W)<^^ + 0(||pn -p||3,max). (56) 

Jo Jo 


Now combining all the above and the definition for Bji, we conclude 

Bl=E(hn{U^J\X,,--- ,Xn) 

= E(fi„(C/^t)|Xi,--- ,X„) + 0 (Haus(i?„,i?)) 

= [ ^n{l 2 {t))dt + 0{Hau5{Rn,R)) (by (54)) 
Jo 

= [ ^{j 2 {t))dt + 0{\\pn - p\\ 3 ,mayi) (by (56)) 
Jo 

= E {^{Ujp)) + 0{\\pn - p||3,max) (by (55)) 

= E {^{Ur)) + 0{\\pn - p||3,max) 

~ Br + 0(11^72 — p||3,max)- 


Therefore, as along as we have \\pn — p|| 3 ,max = op(l), we have 

Similarly, we the same condition implies 

^R-(^R = Op{l). 

Now recall from (51) and Theorem 2, the risk difference is 


Risk 


n,2 


Risk „,2 = {Bi - Bl)h^ + 


nh^+‘^ 


+ o + o 


nh^+‘^ 


= op {h^) + op (by (58) and (59)). 


Since Theorem 2 implies Risk ^^2 = O + O ( ^^i +2 ). by (60) we have 

Riskn,2 - Riskn,2 


Risk 


n,2 


= Op(l) 


which proves the theorem. 

Note that in order (61) to hold, we need \\pn — p|| 3 ,max = c>p(l)- By Lemma 5, 

\\Pn - Plls.max = 0(p) + Op 


I logn ] 
V nh'^+^ j ' 


(57) 


(58) 

(59) 

(60) 


(61) 


(62) 


Thus, a sufficient condition to \\pn — p|| 3 ,max = op(l) is to pick h such that —> 0 and —> 0. 

This gives the restriction for the smoothing parameter h. 


□ 
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